{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "i5w5Oktvr7bC"
      },
      "source": [
        "\n",
        "\n",
        "<div align=\"center\">\n",
        "<p align=\"center\">\n",
        "\n",
        "\n",
        "# 🚀 Synthetic Data Generator\n",
        "\n",
        "</p>\n",
        "</div>\n",
        "The Synthetic Data Generator (SDG) is a specialized framework designed to generate high-quality structured tabular data.\n",
        "\n",
        "Synthetic data does not contain any sensitive information, yet it retains the essential characteristics of the original data, making it exempt from privacy regulations such as GDPR and ADPPA.\n",
        "\n",
        "High-quality synthetic data can be safely utilized across various domains including data sharing, model training and debugging, system development and testing, etc."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "OikMoGinr7bF",
        "outputId": "3ce73c4d-c7e5-4160-b912-0bc78d6d9a11"
      },
      "outputs": [],
      "source": [
        "# install from git\n",
        "!pip install git+https://github.com/hitsz-ids/synthetic-data-generator.git"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NkzQVyvRr7bH"
      },
      "source": [
        "We demonstrate with a single table data synthetic example."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "id": "a1JTesuSr7bH"
      },
      "outputs": [],
      "source": [
        "from sdgx.data_connectors.csv_connector import CsvConnector\n",
        "from sdgx.data_loader import DataLoader\n",
        "from sdgx.data_models.metadata import Metadata"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hNCC1XASr7bH"
      },
      "source": [
        "# 1. Load data\n",
        "\n",
        "The demo data set for this demonstration is a risk control data set used to predict whether an individual will default on a loan. This dataset contains the following features:\n",
        "\n",
        "| Column name | Meaning |\n",
        "|-----------------------|-----------------------|\n",
        "| loan_id | loan ID |\n",
        "| user_id | user ID |\n",
        "| total_loan | Total loan amount |\n",
        "| year_of_loan | Loan period |\n",
        "...\n",
        "\n",
        "This code shows the process of loading real data:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Fo4RAlowr7bH",
        "outputId": "7c82f12e-0ded-43ca-fb67-459bc5022052"
      },
      "outputs": [],
      "source": [
        "# In the future, this part of the function will be integrated into `sdgx.processor`\n",
        "import os \n",
        "import requests\n",
        "\n",
        "\n",
        "def download_file(url, path):\n",
        "    response = requests.get(url)\n",
        "    if response.status_code == 200:\n",
        "        with open(path, 'wb') as file:\n",
        "            file.write(response.content)\n",
        "        print(f\"File downloaded successfully to {path}\")\n",
        "    else:\n",
        "        print(f\"Failed to download file from {url}\")\n",
        "\n",
        "# download dataset from github\n",
        "# This datajset can be downloaded through sdgx.utils \n",
        "dataset_url = \"https://raw.githubusercontent.com/aialgorithm/Blog/master/projects/一文梳理风控建模全流程/train_internet.csv\"\n",
        "file_path = 'train_internet.csv'\n",
        "\n",
        "if not os.path.exists(file_path):\n",
        "    download_file(dataset_url, file_path)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "id": "x5t8M4E6r7bI"
      },
      "outputs": [],
      "source": [
        "from pathlib import Path\n",
        "\n",
        "path_obj = Path(file_path)\n",
        "\n",
        "# Create a data connector and data loader for csv data\n",
        "data_connector = CsvConnector(path=path_obj)\n",
        "data_loader = DataLoader(data_connector)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 2. Create Metadata from Dataloader\n",
        "\n",
        "sdgx supports creating metadata from pd.DataFrame or DataLoader, and also supports creating metadata from scratch from blank metadata (but this method is not recommended because it is more troublesome).\n",
        "\n",
        "In this example, we use `from_dataloader` to create the first Metadata."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {},
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "\u001b[32m2024-03-25 17:10:54.848\u001b[0m | \u001b[1mINFO    \u001b[0m | \u001b[36msdgx.data_models.metadata\u001b[0m:\u001b[36mfrom_dataloader\u001b[0m:\u001b[36m280\u001b[0m - \u001b[1mInspecting metadata...\u001b[0m\n",
            "\u001b[32m2024-03-25 17:10:58.441\u001b[0m | \u001b[1mINFO    \u001b[0m | \u001b[36msdgx.data_models.metadata\u001b[0m:\u001b[36mupdate_primary_key\u001b[0m:\u001b[36m482\u001b[0m - \u001b[1mPrimary Key updated: {'loan_id', 'user_id'}.\u001b[0m\n"
          ]
        }
      ],
      "source": [
        "loan_metadata = Metadata.from_dataloader(data_loader)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let’s first understand some common member variables in Metadata.\n",
        "\n",
        "The most important and commonly used method is to use `column_list` to view column information. This variable returns a list. The order of columns corresponds to the order of the actual table."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "['loan_id',\n",
              " 'user_id',\n",
              " 'total_loan',\n",
              " 'year_of_loan',\n",
              " 'interest',\n",
              " 'monthly_payment',\n",
              " 'class',\n",
              " 'sub_class',\n",
              " 'work_type',\n",
              " 'employer_type',\n",
              " 'industry',\n",
              " 'work_year',\n",
              " 'house_exist',\n",
              " 'house_loan_status',\n",
              " 'censor_status',\n",
              " 'marriage',\n",
              " 'offsprings',\n",
              " 'issue_date',\n",
              " 'use',\n",
              " 'post_code',\n",
              " 'region',\n",
              " 'debt_loan_ratio',\n",
              " 'del_in_18month',\n",
              " 'scoring_low',\n",
              " 'scoring_high',\n",
              " 'pub_dero_bankrup',\n",
              " 'early_return',\n",
              " 'early_return_amount',\n",
              " 'early_return_amount_3mon',\n",
              " 'recircle_b',\n",
              " 'recircle_u',\n",
              " 'initial_list_status',\n",
              " 'earlies_credit_mon',\n",
              " 'title',\n",
              " 'policy_code',\n",
              " 'f0',\n",
              " 'f1',\n",
              " 'f2',\n",
              " 'f3',\n",
              " 'f4',\n",
              " 'f5',\n",
              " 'is_default']"
            ]
          },
          "execution_count": 6,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "loan_metadata.column_list"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 3. Use Inspectors to automatically label column types\n",
        "\n",
        "Currently, when a sdgx's Metadata module pd.DataFrame or DataLoader is created, it will load Inspectors, automatically scan some data (not all data), and label the columns in the table according to the logic of each Inspector.\n",
        "\n",
        "Currently, we support automatic inference of multiple data types, and sdgx supports the following basic types:\n",
        "- bool\n",
        "- int\n",
        "- float\n",
        "- datetime\n",
        "- discrete\n",
        "- id\n",
        "\n",
        "Basic data types guarantee that each column will be labeled to one of the data types.\n",
        "\n",
        "sdgx also supports the following data types, and the sdgx team will continue to add data types:\n",
        "\n",
        "- english_name\n",
        "- email\n",
        "- china_mainland_mobile_phone\n",
        "- china_mainland_id\n",
        "- china_mainland_postcode\n",
        "- unified_social_credit_code\n",
        "- china_mainland_address\n",
        "- chinese_name\n",
        "\n",
        "If you need to query a column of a certain data type, you can access it through `.{column_name}_columns`, for example: access the datetime column through `.datetime_columns`, and access the english_name column through `english_name_columns`.\n",
        "\n",
        "For example, we can access discrete columns through `.discrete_columns`, which will return a set containing the set of column names that are considered **discrete** columns.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "2ILAVWcEr7bI",
        "outputId": "5da2b0cf-ccec-4d61-b982-fd410275eb12"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'class',\n",
              " 'earlies_credit_mon',\n",
              " 'employer_type',\n",
              " 'industry',\n",
              " 'issue_date',\n",
              " 'sub_class',\n",
              " 'work_type',\n",
              " 'work_year'}"
            ]
          },
          "execution_count": 7,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "# Automatically infer discrete columns\n",
        "loan_metadata.discrete_columns"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Similarly, we can view `int_columns`, `bool_columns` and other columns as follows:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "set()"
            ]
          },
          "execution_count": 8,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "# No Bool columns in current tabular data.\n",
        "loan_metadata.bool_columns"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'censor_status',\n",
              " 'del_in_18month',\n",
              " 'early_return',\n",
              " 'early_return_amount',\n",
              " 'early_return_amount_3mon',\n",
              " 'f0',\n",
              " 'f1',\n",
              " 'f2',\n",
              " 'f3',\n",
              " 'f4',\n",
              " 'f5',\n",
              " 'house_exist',\n",
              " 'house_loan_status',\n",
              " 'initial_list_status',\n",
              " 'is_default',\n",
              " 'loan_id',\n",
              " 'marriage',\n",
              " 'offsprings',\n",
              " 'policy_code',\n",
              " 'post_code',\n",
              " 'pub_dero_bankrup',\n",
              " 'recircle_b',\n",
              " 'region',\n",
              " 'scoring_high',\n",
              " 'scoring_low',\n",
              " 'title',\n",
              " 'total_loan',\n",
              " 'use',\n",
              " 'user_id',\n",
              " 'year_of_loan'}"
            ]
          },
          "execution_count": 9,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "# check int columns\n",
        "loan_metadata.int_columns"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Use `datetime_columns` to view datetime types, but note that datetime type needs to add formats before data processing, **datetime formats need to completely correspond to datetime columns**. \n",
        "\n",
        "For specific operations, please refer to the manual interface of metadata below."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'earlies_credit_mon', 'issue_date'}"
            ]
          },
          "execution_count": 10,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "loan_metadata.datetime_columns"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "⚠️ It is worth noting that Inspectors work well in most cases, but all types in tabular data may not be fully covered, or there may be incomplete coverage.\n",
        "\n",
        "Therefore, before proceeding to the next step of training the model or further processing the data, we still recommend that data analysts **check** all the labeling of data types."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 4. Understand the inspect_level mechanism in Metadata\n",
        " \n",
        "Since Metadata will run multiple Inspectors when it is created, the same data column may be labeled multiple times. For example, a column is marked as PostCode and discrete at the same time. In fact, this column is a post code column. \n",
        "\n",
        "From this, we use `inspect_level` to solve this problem. Different inspectors have different inspect levels, and the final mark of the final column is determined by the mark with the higher inspect level."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "defaultdict(<function sdgx.data_models.metadata.Metadata.<lambda>()>,\n",
              "            {'id_columns': 20,\n",
              "             'int_columns': 10,\n",
              "             'float_columns': 10,\n",
              "             'bool_columns': 10,\n",
              "             'datetime_columns': 20,\n",
              "             'discrete_columns': 10,\n",
              "             'china_mainland_mobile_phone_columns': 30,\n",
              "             'unified_social_credit_code_columns': 30,\n",
              "             'china_mainland_id_columns': 30,\n",
              "             'china_mainland_postcode_columns': 20,\n",
              "             'email_columns': 30})"
            ]
          },
          "execution_count": 11,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "# ·column_inspect_level· records the inspect_level values of all inspectors\n",
        "# the default inspect_level is 10 \n",
        "loan_metadata.column_inspect_level"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 5. Metadata manual interface\n",
        "\n",
        "Metadata supports the following manual interfaces, which can finely modify column labels one by one according to your ideas:\n",
        "- query: Query the tag of a certain column.\n",
        "- get: Get all tags by key.\n",
        "- set：Set tags, will convert value to set if value is not a set.\n",
        "- add: Add tags.\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 12,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'loan_id'}"
            ]
          },
          "execution_count": 12,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "loan_metadata.set('id_columns', {'loan_id'})\n",
        "\n",
        "loan_metadata.id_columns"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Note that currently only the datetime type needs to `add formats`, and before data processing, datetime formats need to completely correspond to datetime columns (otherwise the column will be deleted during the data preprocessing process), other data types do not need it."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 13,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "defaultdict(str, {})"
            ]
          },
          "execution_count": 13,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "# datetime_format has no content, which will cause an error in the subsequent process.\n",
        "loan_metadata.datetime_format"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The above four basic methods only apply to columns.\n",
        "\n",
        "For the dict type datetime format, it is recommended to assign values directly."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 14,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'issue_date': '%Y-%m-%d', 'earlies_credit_mon': '%b-%Y'}"
            ]
          },
          "execution_count": 14,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "datetime_format = {\n",
        "    'issue_date': '%Y-%m-%d',\n",
        "    'earlies_credit_mon': '%b-%Y'\n",
        "}\n",
        "loan_metadata.datetime_format = datetime_format\n",
        "\n",
        "loan_metadata.datetime_format"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 6. Get the exact data type of each column\n",
        "\n",
        "We provide the get_column_data_type method to query the final data type of each column:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 15,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "'int'"
            ]
          },
          "execution_count": 15,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "loan_metadata.get_column_data_type(\"f0\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 16,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "'float'"
            ]
          },
          "execution_count": 16,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "loan_metadata.get_column_data_type(\"recircle_u\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If you need to get the exact data type of all columns, you can combine it with the `.column_list` method:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 17,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "loan_id: id\n",
            "user_id: int\n",
            "total_loan: int\n",
            "year_of_loan: int\n",
            "interest: float\n",
            "monthly_payment: float\n",
            "class: discrete\n",
            "sub_class: discrete\n",
            "work_type: discrete\n",
            "employer_type: discrete\n",
            "industry: discrete\n",
            "work_year: discrete\n",
            "house_exist: int\n",
            "house_loan_status: int\n",
            "censor_status: int\n",
            "marriage: int\n",
            "offsprings: int\n",
            "issue_date: datetime\n",
            "use: int\n",
            "post_code: int\n",
            "region: int\n",
            "debt_loan_ratio: float\n",
            "del_in_18month: int\n",
            "scoring_low: int\n",
            "scoring_high: int\n",
            "pub_dero_bankrup: int\n",
            "early_return: int\n",
            "early_return_amount: int\n",
            "early_return_amount_3mon: int\n",
            "recircle_b: int\n",
            "recircle_u: float\n",
            "initial_list_status: int\n",
            "earlies_credit_mon: datetime\n",
            "title: int\n",
            "policy_code: int\n",
            "f0: int\n",
            "f1: int\n",
            "f2: int\n",
            "f3: int\n",
            "f4: int\n",
            "f5: int\n",
            "is_default: int\n"
          ]
        }
      ],
      "source": [
        "for each_col in loan_metadata.column_list:\n",
        "    print(f'{each_col}: {loan_metadata.get_column_data_type(each_col)}')"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": [],
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "sdg",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.0"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
